Characterization of the complete chloroplast genome sequences of six Dalbergia species and its comparative analysis in the subfamily of Papilionoideae (Fabaceae)

Dalbergia spp. are numerous and widely distributed in pantropical areas in Asia, Africa and America, and most of the species have important economic and ecological value as precious timber. In this study, we determined and characterized six complete chloroplast genomes of Dalbergia species (Dalbergia obtusifolia, D. hupeana, D. mimosoides, D. sissoo, D. hancei, D. balansae), which displayed the typical quadripartite structure of angiosperms. The sizes of the genomes ranged from 155,698 bp (D. hancei) to 156,419 bp (D. obtusifolia). The complete chloroplast genomes of Dalbergia include 37 tRNA genes, eight rRNA genes and 84 protein-coding genes. We analysed the sequence diversity of Dalberigia chloroplast genomes coupled with previous reports. The results showed 12 noncoding regions (rps16-accD, trnR-UCU-trnG-UCC, ndhE-ndhG, trnG-UCC-psbZ, rps8-rpl14, trnP-UGG-psaJ, ndhH-rps15, trnQ-UUG-rps16, trnS-GCU-psbI, rps12-clpP, psbA-trnK-UUU, trnK-UUU-intron), and four coding regions (rps16, ycf1, rps15 and ndhF) showed many nucleotide variations that could be used as potential molecular markers. Based on a site-specific model, we analysed the selective pressure of chloroplast genes in Dalbergia species. Twenty-two genes with positively selected sites were detected, involving the photosynthetic system (ndhC, adhD, ndhF, petB, psaA, psaB, psbB, psbC, psbK and rbcL), self-replication category of genes (rpoA, rpoC2, rps3, rps12 and rps18) and others (accD, ccsA, cemA, clpP, matK, ycf1 and ycf2). Additionally, we identified potential RNA editing sites that were relatively conserved in the genus Dalbergia. Furthermore, the comparative analysis of cp genomes of Dalbergieae species indicated that the boundary of IRs/SSC was highly variable, which resulted in the size variation of cp genomes. Finally, phylogenetic analysis showed an inferred phylogenetic tree of Papilionoideae species with high bootstrap support and suggested that Amorpheae was the sister of the clade Dalbergieae. Moreover, three genera of the Pterocarpus clade showed a nested evolutionary relationship. These complete cp genomes provided valuable information for understanding the genetic variation and phylogenetic relationship of Dalbergia species with their relatives.


INTRODUCTION
Chloroplasts (cp) play a vital role in the photosynthetic cells of plants and algae. They contain a double-stranded genome with an independent replicated system, and it is generally recognized that chloroplasts may evolve from cyanobacteria through endosymbiosis (Daniell et al., 2016;Mcfadden & Dooren, 2004). The structure of the cp genome is relatively conserved in plants, including two small inverted repeats (IRa and IRb), a long single-copy (LSC) region and a short single-copy (SSC) region (Palmer & Jeffrey, 1985). The size of the cp genome ranges from 107 kb to 218 kb. However, a variety of mutations have occurred in the cp genome during a long-term evolutionary process, which can help us understand how plants evolved. Therefore, the mutation in the cp genome is used for the study of evolution, phylogeny and phytogeography in plants (Sancho et al., 2017;Yang et al., 2013).
The genus Dalbergia belongs to the Fabaceae family and includes over 250 species distributed in pantropical areas in Asia, Africa and America (Klitgaard & Lavin, 2005). Many Dalbergia species are known as high-value rosewood because of their decorative and excellent wood quality, and their heartwoods with a wide range of colour variation have been used to produce luxury furniture and musical instruments (Bhagwat et al., 2015). Therefore, these species have been suffering from fast-evolving threats that have led to a major decline in wild populations (Winfield, Grayson & Scott, 2016). Aiming to enhance the worldwide protection of rosewood species, an updated list of Dalbergia species has been adopted by CITES (Convention on International Trade in Endangered Species of Wild Fauna and Flora) since 2017, which will support better protection for the diversity and health of rosewood populations (Hung et al., 2020). A complete infrageneric classification of the Dalbergia genus has been built by Bentham (George & Bentham, 1860), but a wide range of species and distribution makes it difficult to perform a comprehensive investigation (Carvalho, 1997;Chen, Zhang & Larsen, 2010;Niyomdham, 2002;Prain, 1904;Sunarno & Ohashi, 1997;Thothathri, 1987;Win, 2020). The first molecular phylogenetic framework of the Dalbergia genus has been established based on a limited number of Dalbergia species. However, the results suggest that the Dalbergia genus is monophyletic and originated from the New World (Vatanparast et al., 2013). Cui proposed a superior and comprehensive phylogenetic framework of the Dalbergia genus based on morphological traits, chloroplast and ITS DNA sequences, and more extensive samples (Cui, 2014;Li, 2017). Due to the higher posterior probability or bootstrap percentages of a phylogenomic tree, it could provide new insight into phylogenetic relationships among different taxa. In addition, a large number of cp genomes of Dalbergia and its relative taxa in Papilionoideae have been published in GenBank in recent years (Wariss et al., 2018;Deng et al., 2018;Xu et al., 2019;Liu et al., 2019;Song et al., 2019;Qin et al., 2020;Hong et al., 2020;Win et al., 2020). These published cp genomes may give us an opportunity to improve previous phylogenetic frameworks. Besides that, more genetic variation features from cp genome could provide new insight to understand the evolution and adaptation of Dalbergia plants in heterogeneous habitats, such as selective pressure and RNA editing site of chloroplast genes. RNA editing is the posttranscriptional modification of an RNA nucleotide sequence to produce more variation of transcripts. Previous reports suggest it may not only be involved in early evolution of land plants, but also be considered to be as a communication mechanism between chloroplast and nucleus to acclimatize a changing world (Schallenberg-Rüdinger & Knoop, 2016;Zhao, Huang & Chory, 2019).
In this article, six complete cp genomes of Dalbergia species (four tree species and two vine species) were de novo assembled by next-generation sequencing, and comparative genomic analysis of Dalbergia species and its related species in the subfamily of Papilionoideae (Fabaceae) was performed. The study aims to (1) obtain a few complete cp genomes of Dalbergia species; (2) reveal the cp genomic characteristics of Dalbergia species; (3) compare the genomic features of Dalbergieae species; and (4) establish an inferred phylogenetic relationship of Papilionoideae species.

Sample collections and DNA extraction
Fresh healthy leaves were collected from seedlings of three Dalbergia species (D. obtusifolia, D. hupeana and D. sissoo) (Doyle, 1987), and agarose gel electrophoresis and a one-drop spectrophotometer were used to detect DNA integrity and quality (OD-1000, Shanghai Cytoeasy Biotech Co., Ltd., Shanghai, China).

DNA sequencing, genome assembly and validation
We constructed shotgun libraries (150 bp) with genomic DNA and sequenced them on the BGI-500 platform. High-quality clean data were obtained by trimming the original data from both ends and removing the adapter and low-quality reads. Following quality filtering, we employed Bowtie2 v2.2.6 to map reads on a local cp genome database (Langmead & Salzberg, 2012). NOVOPlasty and CAP3 were used to de novo assemble the cp genomes with the starting sequence (Huang & Madan, 1999;Nicolas, Patrick & Guillaume, 2017). We used Blast, Hmmer 3, Aragorn and manual correction to predict gene, rRNA and tRNA sequences (Altschul et al. 1997;Dean & Bjorn, 2004;Eddy & Pearson, 2011). Organellar GenomeDRAW v1.3.1 was employed to draw a circular map of the cp genome, and CGV95 was adopted to visualize the annotation results (Lohse, Drechsel & Bock, 2007). Then, we adopted 34 primer pairs to prove junctions in six Dalbergia species cp genomes by PCR-based sequencing. A 20-µL reaction volume was set as the PCR program with a thermal cycler (Applied Biosystems, Foster, CA, USA): 2×Es Taq MasterMix (Dye) (CWbio, Beijing, China) with 10 µL, DNA with approximately 50 ng, forward primer and reverse primer with 5 pmol, respectively, and sterile double-distilled water was added to the 20 µL volume. We employed the following procedure to perform PCR amplification: 5 min of denaturation at 94 • C; 94 • C with 35 cycles of denaturation for 30 s, the optimal temperature to anneal for 30 s, followed by 30 s with 72 • C for extension; and then 5 min at 72 • C in the amplifications eventually for extension. After PCR amplification was completed, the amplified products were sequenced and compared with the assembled chloroplast genome (Table S1). Eventually, these accurate chloroplast genomes were submitted and stored in the NCBI GenBank (https://www.ncbi.nlm.nih.gov/, accession numbers MN714219-714222, MN905599 and MN833948). These newly assembled cp genomes of the Dalbergia genus were analysed for codon usage patterns. The protein-coding genes with more than 300 nucleotides were extracted to analyse the codon usage indices, including the relative synonymous codon usage (RSCU) and codon adaptation index (CAI), by using CodonW v1.4.4 (http://codonw.Sourceforge.net/). RSCU values can directly reflect codon usage bias. When RSCU values approach 1, all synonymous codons encoding the same amino acid were used equally. CAI refers to the adaptation index of all codons actually encoding the protein for the case where the optimal codon is used to encode the protein. It is also used to measure the level of gene expression. Higher CAI values implied a strong codon usage bias and a higher expression level (Sharp & Li, 1987).

Repeat sequence analyses in the cp genome
We adopted MIcroSAtellite (http://pgrc.ipk-gatersleben.de/misa/) to analyse simple sequence repeats (SSRs) in the cp genomes. The SSR motif types of mononucleotide, dinucleotide, trinucleotide tetranucleotide, pentanucleotide and hexanucleotide were set as 10, 6, 5, 5, 5, and 5 for the minimum repeat units, respectively, and we designed primer pairs by Primer 3 in the flanking region with all candidate loci (Andreas et al., 2012). We employed the Tandem Repeats Finder to screen tandem repeat sequences. Match -2, Mismatch -7, and Delta-7 were set as the alignment parameters, the minimum alignment score was set at 80, and the maximum period size was set as 500 (Benson, 1999). We used REPuter (https://bibiserv.cebitec.uni-bielefeld.de/reputerl) to analyse palindromic repeat sequences, complement repeats and dispersed repeat sequences, and 30 and 3 were set as the minimum repeat size and maximum base mismatch, respectively (Kurtz & Schleiermacher, 1999).

Gene selective pressure analysis
The genes under selection were detected by the Muscle (codon) employed in MEGA7 to align the sequences of the protein-coding gene separately in Dalbergia (Sudhir, Glen & Koichiro, 2016), and we used Fast-Tree 2.0 to build the maximum likelihood (ML) phylogenetic tree with 1,000 bootstraps based on complete cp genome sequences (Price, Dehal & Arkin, 2010). The site-specific model mainly assumes that different amino acid sites are subject to different selection pressures (regardless of the differences in selection pressure between different branches) in the dataset. This model is mainly employed to detect the effect of positive selection (ω > 1) in the CODEML algorithm (Price, Dehal & Arkin, 2010) employed in EasyCodeML (Gao et al., 2019) and consists of seven codon substitution models, containing M0 (one-ratio), M1a (nearly neutral), M2a (positive selection), M3 (discrete), M7 (beta), M8 (β& ω > 1) and M8a (β& ω = 1). The likelihood-ratio test was used to compare the fit of these models to the sequence data. Positively selected sites were detected by three site-specific models, M1a vs. M2a, M7 vs. M8, and M0 vs. M3 (Wan-Lin et al., 2018).

RNA editing sites
RNA editing was first found 30 years ago (Covello & Gray, 1989), and it also occurs within chloroplasts in plants (Small et al., 2020). RNA editing is one of the essential ways to regulate the expression of chloroplast genes at the posttranscriptional level, which causes nucleotide substitutions, deletions and insertions, thereby changing the coding information of the original transcript (Bentolila et al., 2013;Harris, Petersen-Mahrt & Neuberger, 2002). PREP (predictive RNA editor for plants, http://prep.unl.edu/) is an online program for predicting RNA editing sites, with a series of advantages, requiring minimal input, running fast, and high accuracy (Mower, 2009). A total of 35 protein-coding genes from the cp genome of Dalbergia species were submitted for predicting potential RNA editing sites with a cut-off value of 0.8.

Comparative analyses of cp genomes in Dalbergieae species
A total of 37 Dalbergieae species that are available in the NCBI database and six cp genomes of Dalbergia species were chosen for comparative genomic analysis. Then, the cp genomes were compared by mVISTA (Frazer et al., 2004) in shuffle-LANGAN mode with default parameters for other options, and D. obtusifolia was set as a reference. IRSCOPE was employed to analyse the boundaries between the four main regions of the annotated cp genomes to investigate the contraction or expansion of the IR regions (Ali, Jaakko & Peter, 2018). Noncoding regions (percentage of variability > 25%) and coding regions (percentage of variability > 8%) were selected acted as mutational hotspots in the comparative analyses of Dalbergia species. Percentage of variable = (the number of indels + number of nucleotide mutations)/(the length of aligned sites − the length of indels + the number of indels) × 100% was the formula (Dong et al., 2018).

Phylogenetic analysis
The phylogenetic relationships with the Papilionoideae subfamily members, including 165 cp genomes available in NCBI and six newly assembled genomes of the Dalbergia genus, were reconstructed in this study. A set of 77 common genes shared by these cp genomes were aligned by MAFFT v7 with the default settings. Maximum likelihood (ML) and maximum parsimony (MP) were performed to reconstruct the phylogenetic tree. All the gaps were excluded after alignment in two method analyses. ML was used in FastTree 2.0 (Price, Dehal & Arkin, 2010) with 1,000 bootstrap replicates and the GTR model of nucleotide substitution (Kazutaka, Daron & Standley, 2013). PAUP v4.0 (Swofford, 2002) was performed to construct an MP tree with 1,000 ratchet repetitions and a tree bisectionreconnection (TBR) branch swapping algorithm. Phylogenetic trees with bootstrap values (BS) were visualized using ITOL v6 (http://itol.embl.de/) (Letunic & Bork, 2021).  1, Table 1). The overall GC content of the IR region (42.8%) was higher than those of the LSC and SSC regions (approximately 33.7% and 29.5%, respectively) because of the high GC content in rRNAs,

Repeat structure and SSR analyses
We  (Table 1).
A total of 149-176 candidate cpSSR loci were detected in six Dalbergia species (Table 1). A majority of SSR loci were located in the LSC region (72.55%-76.82%), followed by the SSC region (17.88%-22.53%). These findings indicated that the distribution of cpSSRs was imbalanced in the genus Dalbergia. Then, mono-, di-and trinucleotide SSRs were detected in each Dalbergia species, and the results showed that the majority of the mononucleotide repeats were A or T repeats. The majority of the dinucleotide repeat sequences consisted of AT or TA repeats, while a few TC repeats were also screened (Table S3).

RNA editing site
We analysed the full set of chloroplast coding sequences, and the results showed that the number of RNA editing sites ranged from 32 to 43 in those six Dalbergia species, involving 18-20 protein-coding genes (Table 4). All the predicted RNA editing sites are the conversion of cytosine (C) to thymine (T) and may have caused amino acid changes. A majority of RNA editing occurs in the second codon, and a few occur in the first position of the codon. We also found that the conversion of amino acids caused by RNA editing is mostly from serine (S) to leucine (L). These changes could reverse protein polarity and affect hydrophobicity (Pinard, Myburg & Mizrachi, 2019;Rabah et al., 2017). The results showed that almost half of the RNA editing sites were located in the ndh genes; the number of RNA editing sites in the ndhB gene was the largest, with 10, followed by ndhA (5), ndhF (4), ndhG (4), rps2 (3), accD (2), atpI (2), ccsA (2), matK (2), ndhD (2), psbL (2), rpoA (2), and rps16 (2), and the other genes had only one editing site (Table S5). Moreover, we also observed that over half of the conventions at the codon positions changed from S (serine) to L (leucine).

Sequence divergence in Dalbergieae species
We performed a BLAST analysis of the complete sequences of 43 cp genomes (21 Dalbergia spp., three Stylosanthes spp., 13 Arachis spp. and 6 Pterocarpus spp.) using mVISTA, and the highly divergent regions are shown in Fig. 2; the result showed a high sequence similarity across the cp genomes, with a sequence identity over 70.0%, and the variability of protein-coding regions was less than those of noncoding regions. Twelve regions within the noncoding regions (rps16-accD, trnR-  and 4 regions within the coding regions (rps16, ycf1, rps15 and ndhF ) showed greater levels of variation (percentage of variability >25% and 8%, respectively) (Fig. 3).
The inverted repeat and single-copy (IR/SC) boundary regions of cp genomes were examined and are illustrated in Fig. 4, and the results showed that the boundary of the LSC/IRs was highly conserved, whereas the borders of IRs/SSC were highly variable in Dalbergieae species. First, the ndhF gene and ycf1 pseudogene crossed in the SSC/IRb regions within the two parts for all the Arachis and Stylosanthes species. Second, the ycf1 gene was complete in the IRb region for all Pterocarpus species and most Dalbergia species. Third, the ycf1 gene crossed the boundary of the SSC/IRa region for all four genera mentioned above.

Phylogenetic relationship of Papilionoideae species
We screened a total of 77 common genes from the cp genomes of Papilionoideae species to build a phylogenetic tree by performing the maximum likelihood (ML) and maximum parsimony (MP) methods. All the ML and MP trees were highly congruent in identifying these Papilionoideae species of the phylogenetic position. The results showed an inferred phylogenetic tree of Papilionoideae species with high bootstrap support (Fig. 5). Twentyone species of Dalbergia formed a monophyletic group with high support (BS = 100 for ML and MP) and were resolved as an early diverging lineage from the Dalbergieae clade. Dalbergia species were clustered into the Supertr. Dalbergiodae with the Amorpheae clade also suggested that Amorpheae was a sister to the Dalbergieae clade. In addition, three

Genetic variation in cp genomes
We determined the complete cp genomes of six Dalbergia species. These cp genomes have a typical quadripartite structure, and the variation in cp genome sizes for Dalbergia species is mainly caused by the sizes of the two single-copy regions. The GC content of SSC regions is lower than that of IR regions, which may result from the presence of rRNA genes in IR regions (Asaf et al., 2016). Similar to most land plants, the protein-coding genes of the cp genome in Dalbergia species are highly conserved (Daniell et al., 2016).
In addition, we identified over 140 SSRs in six Dalbergia species; approximately 70% of the SSRs were distributed in the LSC regions, and the majority of these SSRs were monoand dinucleotide repeats. SSRs are widely distributed across nuclear and plastid genomes and have a great influence on genome recombination and rearrangement (Guisinger et al., 2010). Additionally, the results suggested that the variation in the boundary of the SC/IR region contributed to the size variation in cp genomes in Dalbergieae species. In the Dalbergia species, noncoding regions (percentage of variability > 25%) and coding regions (percentage of variability > 8%) had a large degree of variation and acted as mutational hotspots. The percentage of variation is equal to the sum of indels and the number of nucleotide mutations divided by the length of the alignment site minus the length of indels plus the number of indels, and the final result is multiplied by 100% using the formula ( Dong et al., 2018). 12 noncoding regions and 4 coding regions showed greater levels of variation. These sequences could be used to develop potential molecular tools for further study of phylogenetic relationships and population genetics.

RNA editing sites
RNA editing was first found 30 years ago (Covello & Grey, 1989), and it also occurs within chloroplasts in plants (Small et al., 2020). RNA editing is one of the essential ways to regulate the expression of chloroplast genes at the posttranscriptional level (Bentolila et al., 2013;Harris, Petersen-Mahrt & Neuberger, 2002). RNA editing is a posttranscriptional process that may trigger changes in coding information from original transcripts (Takenaka et al., 2013). Therefore, identifying RNA editing sites in the cp genome will provide us with more information on evolutionary dynamics. The results showed that 12 common genes contained potential RNA editing sites in Dalbergia species, accounting for 54.54% of the total. These findings indicated that the variation model of RNA editing sites in cp genomes is relatively conservative for the genus Dalbergia. We also found that most of the conversions at the codon could lead to amino acid changes from polar to apolar and could result in an increase in protein hydrophobicity (Pinard, Myburg & Mizrachi, 2019).

Adaptive evolution of chloroplast genes
We used a site-specific model to estimate the selection pressure, and 22 genes with positive selection sites were found in Dalbergia species. The ycf1 gene is one of the largest genes encoding a part of the chloroplast's inner envelope membrane protein translocon (Kikuchi et al., 2013), and the ycf1 gene in the genus Dalbergia has been shown to be subject to positive selection from 15 sites. Three NADH dehydrogenase subunit genes (ndhC, ndhD and ndhF ) possessed at least two positively selected sites, implying that these family members were potentially under positive selective pressure in Dalbergia species. NADHdehydrogenase subunits are important in the electron transport chain for the generation of ATP and are essential components for photosynthesis in plants (Weiss et al., 1991). Additionally, we found that the rbcL gene possessed four sites under positive selection. This gene encodes the large subunit of Rubisco protein, which is an important component as a modulator of photosynthetic electron transport (Allahverdiyeva et al., 2005;Piot et al., 2018). The positive selection of the rbcL gene may be a common phenomenon in land plants (Kapralov & Filatov, 2007). The ycf2 gene, as a giant reading frame, possesses three sites under positive selection, and its function is still unknown (Sobanski et al., 2019). The accD gene encodes the β-carboxyl transferase subunit of acetyl-CoA carboxylase, which acetyl-CoA carboxylase catalyses the first and rate-limiting step of lipid biosynthesis (Sobanski et al., 2019). Two positively selected sites were identified in the accD gene. It is believed that these positively selected genes play a key role in adapting to different environments in plant evolutionary processes. In addition, the wide pantropical distributions and heterogeneous habitats might have increased the rates of evolution and speciation of Dalbergia species for greater adaptation (Hu et al., 2015).

Phylogenetic relationship
We reconstructed a phylogenetic tree using the ML method and MP method based on 77 common genes of Papilionoideae species. The inferred tree has four clear clades and was consistent with previous reports (Zhang et al., 2020;Legume Phylogeny Working Group (LPWG), 2017;Zhao et al., 2021), and the phylogenetic relationships of some Dalbergia species were consistent with those of previous studies (Vatanparast et al., 2013;Cui, 2014). In our study, all Dalbergia species were divided into two main clades; one clade contained seven species (D. hupeana, D. balansae, D. obtusifolia, etc.), and most of these species were diadelphous and widely distributed in southeast Asia and southern China. The other clade contained 14 species, which could be grouped into several subclades. D. hancei, D. mimosoides and D. cultara were placed into one subclade with a higher bootstrap percentage. The topology structure of our phylogenetic tree was in accordance with previous phylogenetic relationships approximately. However, there are different and new advances of phylogenetic tree reconstruction. D. obtusifolia is a tree species and distributed in the Southwest and South Yunnan in China. The species was located at different positions in phylogenetic trees based on chloroplast DNA, nuclear DNA and their combined sequences (Cui, 2014). Our results supported the species was clustered together with D. cochinchinensis, and further constituted a large subclade combing D. barienesis, D. oliveri, D. hainanensis, D. balansae and D. hupeana. These Dalbergia species distributed continuously from a tropical area of Indochina peninsula to subtropical area of East Asia. The Dalbergia branch had a strongly supported topology and showed that some species have a close evolutionary relationship, e.g., D. odorifera and D. tonkinensis and D. hupeana and D. balansae. However, species delimitation is an interesting issue that has always attracted renewed attention. D. barienesis has been treated as a synonym of D. oliveri, but the genetic divergence of its cp genomes was higher than those of two species (D. hupeana and D. balansae). Therefore, we still need further genome-wide knowledge, especially to understand the process of speciation for relatives in the Dalbergia genus.

CONCLUSIONS
Herein, little difference was found in the genome size of the six sequenced cp genomes of Dalbergia spp. The gene content was relatively conserved, while IR boundary was highly variable in Dalbergieae species. Meanwhile, a number of cpSSRs and hotspots of nucleotides variation were screened, and selective pressure and RNA editing site of chloroplast genes also were identified in Dalbergia cp genomes. This will promote our understanding of their genetic variation features. Besides that, the reconstruction phylogenetic framework of chloroplast genome elucidated the relationships among species in the subfamily of Papilionoideae (Fabaceae). It also supported and improved a previous phylogenetic framework of Dalbergia genus based on chloroplast and nuclear DNA sequences. This indicated phylogenomic framework based on cp genome has advantages in inferring phylogenetic relationships of plants.